Understand Deep Generative Models from Ground Up

林祥瑞

Preface

The lecture has 40 min time limit and we have length slides to consume. We may skip some pages if we have to :3.

Mathematical Foundations

In this section, we will have brief intro on information theory. Then we dive into these concepts that are fundamental to deep generative models.

  • KL-divergence

  • Jensen–Shannon divergence

Information Theory - Recap

  • The amount of information can be understood as the surpriseness of an event.

  • That is, rare events give us more information. We need more bits to encode that event in terms of binary encoding.

  • Suppose we have a fair dice, which is represented by a rv \(X\) with events \(\{1, 2, 3, 4, 5, 6\}\). The amount of information of event \(X = 3\) is measured by \(I(X = 3) = \log_2 \frac{1}{P(X = 3)} \approx 2.58\).

Information Theory - Information Content

  • The amount of information can be intuitively understood as # of perfect queries to locate that event.

  • Suppose our \(X = 3\) example. We ask three yes-no queries to find it out.

    • Is it greater than or equal to 4? No

    • Is it greater or equal to 2? Yes

    • Is it 3? Yes

  • With perfect yes-no query, it is expected to ask \(\log_2 \frac{1}{P(X = 3)}\) questions.

Information Theory - Information Content

  • In fact, we need law of large numbers and many theories to show this idea. It is out of scope in this lecture.

  • The information content is additive on individual events. Suppose we roll the dice twice. The information content of joint event \(X_1 = 3, X_2 = 2\) is

\[\begin{aligned} \log_2 \frac{1}{P(X_1 = 3, X_2 = 2)} \\ = \log_2 \frac{1}{P(X_1 = 3)} + \log_2 \frac{1}{P(X_2 = 2)} \\ = I(X_1 = 3) + I(X_2 = 2) \end{aligned}\]

Information Theory - Entropy

  • The (Shannon) entropy of rv \(X\), denoted as \(H(x)\), can be understood as expected amount of information.

  • Suppose a rv \(X\) and its pdf/pmf \(P(x)\).

\[\begin{aligned} H(X) \ \\ = E_{X \sim P}[I(X)] \ \\ = \sum_x P(x) \log_b \frac{1}{P(x)} \ \text{(discrete)} \\ = \int_x P(x) \log_b \frac{1}{P(x)} \ \text{(continuous)} \end{aligned}\]
  • The entropy is usually interpreted as the randomness of a random variable.

Information Theory - Information Gain

  • What if we encode a true distribution \(P\) with artifact distribution \(Q\)?

  • Encoding based on \(Q\) can be thought of as an imperfect query for \(P\).

  • The difference \(E_{x \sim P(x)}[\frac{1}{Q(x)}] - E_{x \sim P(x)}[\frac{1}{P(x)}]\) is always non-negative due to Gibb’s inequality.

Information Theory - Conditional Entropy

  • Suppose you have the knowledge of rv \(Y\), and would like to know the randomness of rv \(X\). It can be modeled as conditional entropy \(H(X|Y)\).

  • The rationale goes as following:

    • Given value \(y\), \(H(X | Y = y) = \sum_x P(X = x | Y = y) \log \frac{1}{P(X = x | Y = y)} \)

    • \(H(X|Y) = \sum_y P(y) H (X | Y = y)\) is expected value over \(y\).

Information Theory - Conditional Entropy

  • \(H(X | Y)\) is smaller than or equal to \(H(X)\) because you have prior knowledge of \(Y\).

  • For example, \(X\) indicates whether teacher is inside R442, and \(Y\) is whether you observe teacher walk into R442.

  • \(H(x) \gt 0\) because you cannot make sure his or her existence. \(H(X|Y)\) is strictly less than \(H(X)\) because you know the outcome of \(X\) if \(Y = \text{true}\).

  • \(H(X) = H(X|Y)\) if \(X\) and \(Y\) are individual events.

Information Theory - Mutual Information

  • The dependence of rvs \(X, Y\) is modeled as mutual information \(I(X; Y)\).

  • In fact,

\[\begin{aligned} I(X; Y) \\ = H(X) - H(X | Y) \\ = H(Y) - H(Y | X) \\ = H(X) + H(Y) - H(X, Y) \\ = H(X, Y) - H(X|Y) - H(Y|X) \end{aligned}\]

Information Theory - Mutual Information

Kullback–Leibler Divergence

  • Suppose two probability distributions \(P\) and \(Q\). The Kullback–Leibler divergence of \(P\) with respect to \(Q\) is formulated as:

\[D_{\text{KL}} (P \parallel Q) = E_{x \sim P}[\log \frac{1}{Q(x)}] - E_{x \sim P}[\log \frac{1}{P(x)}]\]

Kullback–Leibler Divergence - Interpretations

  • \(P\) is regarded as true probability, and \(Q\) is regarded as theory.

  • In coding theory, it measures expected # of extra bits to encode samples from \(P\) using code optimized for \(Q\).

  • In Bayesian inference, \(Q\) is prior and \(P\) is posterior. In detail,

    • \(Q = p(\theta)\) the probability over our model parameters \(\theta\).

    • \(P = p(\theta | x)\) give a new observation \(x\).

    • The divergence is information gain by revising the belief \(Q\) to \(P\).

Kullback–Leibler Divergence & Jensen–Shannon Divergence

  • KL-div is not symmetric! \(D_{\text{KL}(P \| Q)}\) may not be the same with \(D_{\text{KL}(Q \| P)}\).

  • Jensen–Shannon divergence fixes this by

\[D_{\text{JS}} (P \| Q) = \frac{1}{2} D_{\text{JS}} (P \| M) + \frac{1}{2} D_{\text{JS}} (Q \| M)\]

where \(M = \frac{1}{2}(P + Q)\)

Kullback–Leibler Divergence & Jensen–Shannon Divergence

Remarks

  • KL-divergence should never be misnamed as KL-metric or KL-distance. Since it is not symmetric, it does not fit the mathematical definition of metric.

  • \(D_{\text{KL}}(P \| Q)\) could be infinite if \(P(x) \gt 0\) and \(Q(x) = 0\) for some \(x\).

  • Both KL-div and JS-div does not respect the metric on support, leading to vanishing gradient.

  • We will introduce Watterstein metric to resolve these issues. === Conclusion

  • Understand basics of information theory and intuition behind KL-divergence.

  • Knows Jensen-Shannon divergence.

Variational Autoencoder (VAE)

  • Vanilla autoencoder and noise autoencoder

  • Gaussian mixture model

  • Evidence lower bound (ELBO)

Autoencoder

  • Intuition: Given data sample \(x\), encode it into latent space code \(m\). Then decode it into reconstruction \(x'\).

  • Trained by minimizing the difference b/w input and reconstruction \(L(x, x')\), usually by a L2 or cross-entropy.

  • We can represent \(x\) by lower dimensional code \(m\).

autoencoder

Noisy Autoencoder

  • The autoencoder may not be robust to slight changes on input \(x\).

  • One solution is to add some noise \(z\) on input \(x\).

noisy autoencoder

Deriving VAE

  • Can we slightly change the code \(m\) to \(m'\), and generate new reasonable sample by decoding \(m'\)?

  • In fact, it does not work as expected, because both encoder and decoder are non-linear. We cannot expect the latent space has that good property.

  • Solution: add some noise \(z\) on latent code \(m\).

Deriving VAE

VAE - Design

In VAE, we add a learned noise on latent code as \(c_x = m_x + (\mu_x + e \cdot \exp(\sigma_x)) = m_x + z_x\).

  • \(x\): the input sample

  • \(m_x\): vector of latent space code

  • \(\mu_x\): learned mean (李宏毅的版本沒這一項)

  • \(\sigma_x\): learned logit of variance

  • \(\exp(\sigma_x)\): noise variance, exponent is necessary to avoid negative values from neural network

  • \(e\): the unit Normal noise

  • \(z_x\): the learned noise

VAE - Design

VAE - Neural Network Perspective

VAE in neural network perspective

  • \(q_\theta (z | x)\) is probability function of encoder

  • \(p_\phi (x | z)\) is probability function of decoder

VAE - Neural Network Perspective

  • We add noise regularizar term \(D_{\text{KL}} (q_\theta(z | x) \| p(z))\) to loss, where

    • \(q_\theta(z|x)\) is the normal distribution \(\mathcal{N}(\mu_x, \exp(\sigma_x))\) given by input \(x\)

    • \(p(x)\) is unit normal \(\mathcal{N}(0, 1)\)

  • In implementation, the resulting loss is \(\text{ReconstructionLoss}(x, x') + D_{\text{KL}} (q_\theta(z | x) \| p(z))\)

  • However, the loss should be \(E_{z \sim q_\theta (z|x)}[\log \frac{1}{p_\phi (x | z)}] + D_{\text{KL}} (q_\theta(z | x) \| p(z))\). We do not adopt this due to impl difficulty.

VAE - Inspiration from Gaussian mixture model

VAE - Mathematical Perspective

I found two ways to interpret this model.

We adopt 李宏毅’s version here.

VAE - Mathematical Perspective

  • The distribution of evidence term \(P(x)\) is fixed and is intractable to compute.

  • We can approximate by maximizing likelihood \(\tilde{P}(x) = \tilde{P}(x_1) \tilde{P}(x_2) + \cdots + \tilde{P}(x_n)\) over all sampled data \(x_i\). Note that \(P(x)\) cannot be known, \(\tilde{P}\) is our parameterized function.

  • In practice, we maximize the log likelihood \(\log \tilde{P} (x) = \log \tilde{P} (x_1) + \log \tilde{P} (x_2) + \cdots + \log \tilde{P} (x_n)\)

VAE - Mathematical Perspective

VAE - Mathematical Perspective

Conclusion

  • Know the design of VAE.

  • Understand the theory foundation of VAE.

Generative Adversarial Network (GAN)

  • Recap on vanilla GAN

  • Understanding Wasserstein metric

GAN - Inspiration

  • In context of VAE, we compute the reconstruction error using hand-crafted function.

  • Why not let the model learn to discriminate the differences?

GAN - Model Design

GAN - Model Design

  • We randomly draw \(z\) from latent space.

  • The generator outputs fake samples \(\tilde{x} = G(z)\)

  • The discriminator learns to distinguish between true sample \(x\) from dataset and fake samples \(\tilde{x}\) from generator.

  • The discriminator outputs value from 0 to 1 to answer whether it is a true sample or not.

GAN - Formulation

  • Suppose the distributions

    • \(p_z\): distribution over noise input \(z\), usually uniform

    • \(p_g\): the distribution of generator over data \(\tilde{x}\)

    • \(p_r\): the distribution over real sample \(x\)

  • GAN can be formulated as a minmax game with game value \(\min_G \max_D L(D, G)\).

    • Generator minimizes the profit

    • Discriminator maximize the profit

GAN - Formulation

  • The game value is defined as

\[L(D, G) = E_{x \sim p_r(x)} [\log D(x)] + E_{z \sim p_z(z)} [1 - D(G(z))] \\ = E_{x \sim p_r(x)} [\log D(x)] + E_{z \sim p_g(\tilde{x})} [1 - D(\tilde{x})]\]
  • Theoretically, the value stops at a Nash equilibrium. (and in fact not)

GAN - Training

We repeat this loop to train the generator and discriminator:

  1. Unfreeze the generator and freeze the discriminator.

  2. Train the weights of generator by feeding random latent \(z\). Store the generated image in the mean time.

  3. Freeze the generator and unfreeze the discriminator.

  4. Feed real data samples and generated (fake) data samples to train the discriminator.

GAN Variants - MSG-GAN

GAN Variants - Progressive GAN

GAN Varants - CycleGAN

GAN Varants - CycleGAN

GAN Varants - CycleGAN

GAN Variants - The GAN Zoo

Some random guy compile the published GANs in this list.

Problem of GAN - Hard to Reach Nash Equilibrium

  • In training session of GAN, each player (generator and discriminator) updates its cost without respecting another player in game.

  • It’s shown that it’s not guaranteed to coverage to NE under two-player non-cooperative game.

Problem of GAN - Hard to Reach Nash Equilibrium

  • The first player minimizes \(f_1(x) = xy\), while the second minimizes \(f_2(y) = -xy\).

  • The value oscillates if they minimize values respectively.

non nash equilibrium

Problem of GAN - Low Dimensional Supports

low dim manifold

We omit the explanation here since it requires mathematical background.

Problem of GAN - Vanishing Gradient

Problem of GAN - Vanishing Gradient

Problem of GAN - Vanishing Gradient

\[\begin{aligned} D_{KL}(P \| Q) &= \sum_{x=0, y \sim U(0, 1)} 1 \cdot \log\frac{1}{0} = +\infty \\ D_{KL}(Q \| P) &= \sum_{x=\theta, y \sim U(0, 1)} 1 \cdot \log\frac{1}{0} = +\infty \\ D_{JS}(P, Q) &= \frac{1}{2}(\sum_{x=0, y \sim U(0, 1)} 1 \cdot \log\frac{1}{1/2} + \sum_{x=0, y \sim U(0, 1)} 1 \cdot \log\frac{1}{1/2}) = \log 2\\ W(P, Q) &= |\theta| \end{aligned}\]

Problem of GAN - Mode Collapse

Wasserstein GAN - Introduction

  • Wasserstein GAN (WGAN) is a variant of GAN that adopts Wasserstein metric, also called Earth mover metric.

  • It has no sign of mode collapse in experiments.

  • Wasserstein distance can be understood as the minimum effort to move the piles from distribution \(P\) to distribution \(Q\).

Wasserstein GAN - Wasserstein Metric Example

Wasserstein GAN - Wasserstein Metric Example

Wasserstein GAN - Formulate WGAN

The transport plan can be thought of as join probability b/w two distributions, where \(\Pi(p_r, p_g)\) is the collection of joint distributions of \(p_r\) and \(p_g\).

\[W(p_r, p_g) = \inf_{\gamma \sim \Pi(p_r, p_g)} \mathbb{E}_{(x, y) \sim \gamma}[\| x-y \|]\]

However, the formula above is computationally intractable. The authors of WGAN proposed a formula based on Kantorovich-Rubinstein duality. (click link for explanation)

\[W(p_r, p_g) = \frac{1}{K} \sup_{\| f \|_L \leq K} \mathbb{E}_{x \sim p_r}[f(x)] - \mathbb{E}_{x \sim p_g}[f(x)]\]

Wasserstein GAN - Formulate WGAN

\[W(p_r, p_g) = \frac{1}{K} \sup_{\| f \|_L \leq K} \mathbb{E}_{x \sim p_r}[f(x)] - \mathbb{E}_{\tilde{x} \sim p_g}[f(\tilde{x})]\]
  • \(\| f \|_L \leq K\) means K-Lipschitz continuous \(\lvert f(x_1) - f(x_2) \rvert \leq K \lvert x_1 - x_2 \rvert\).

  • The generated samples \(\tilde{x}\) can be replaced with the generator function \(\tilde{x} = g_\theta(z)\), while \(f\) acks as the discriminator.

  • Based on revised Wasserstein metric, it maximizes the metric \(W\) while making sure \(f\) is K-Lipschitz continuous.

Wasserstein GAN - Training

Wasserstein GAN - Training

  • It maintains Lipschitz continuity by clipping weights! It’s shown that there are side effects.

  • Many improvements have been done, especially WGAN-GP.

  • No time for further explanation in this lecture. Here is the link for your interest.

Wasserstein GAN - Variants

Conclusion

  • Comprehend design of GANs and its variants.

  • Know the pros and cons of WGAN.

Q & A